French Multi Word Expressions: Using Data on Different Patterns for Extraction and Validation
نویسنده
چکیده
Mutlti Word Epressions (MWE) are an important problem in NLP. Many researchers use association measures for collecting and evaluating MWE candidates. In this paper we propose to check if it is legitimate to use those measures when data are only collected on one pattern of MWE (e.g. NounAdjective) for evaluating candidates belonging to an other pattern (e.g. NounNoun). For this purpose, we run tests on the French Europarl corpus. Using association measures extracted from NounAdjective patterns as features, we train a model that we evaluate on instances of Noun-Noun candidates. We notice with this method that the model will still evaluate correctly a quarter of the candidates. However the result tend to be lower.
منابع مشابه
Project proposal Automatic extraction and evaluation of MWE: adapting method to French Language Technology: Research and Development
Our project is based on the theme of Multi Word Expressions (MWE) we will focus on the problem of extraction. This task is important for improving lexical resources used for tasks such as tokenization, parsing or translation. In our study we will work on a French corpus. Our aim will be to not only select but also validate automatically which candidates are the true ones. If we have time we wil...
متن کاملExtraction of Nominal Multiword Expressions in French
Multiword expressions (MWEs) can be extracted automatically from large corpora using association measures, and tools like mwetoolkit allow researchers to generate training data for MWE extraction given a tagged corpus and a lexicon. We use mwetoolkit on a sample of the French Europarl corpus together with the French lexicon Dela, and use Weka to train classifiers for MWE extraction on the gener...
متن کاملCorpus-Driven Study of Multi-Word Expressions Based on Collocations from a Very Large Corpus
We present a corpus-driven approach to the study of multi-word expressions, which constitute a significant part of. As a data basis, we use collocation profiles computed from DeReKo (Deutsches Referenzkorpus), the largest available collection of written German which has approximately two billion word tokens and is located at the Institute for the German Language (IDS). We employ a strongly usag...
متن کاملFeature selection using genetic algorithm for classification of schizophrenia using fMRI data
In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...
متن کاملTowards a mixed approach to extract biomedical terms from documents
The proposed work aims at automatically extracting biomedical terms from free text. We present new extraction methods taking into account linguistic patterns specialized for the biomedical field, statistic term extraction measures such as C-value and statistic keyword extraction measures such as Okapi BM25, and TFIDF. These measures are combined in order to improve the extraction process and we...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013